Search CORE

53 research outputs found

Exploring subdomain variation in biomedical language.

Author: Korhonen Anna
Lippincott Thomas
Séaghdha Diarmuid Ó
Publication venue: BMC Bioinformatics
Publication date: 27/05/2011
Field of study

BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. RESULTS: Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. CONCLUSIONS: We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers

Springer - Publisher Connector

PubMed Central

Apollo (Cambridge)

Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints

Author: Gašić Milica,
Korhonen Anna,
Leviant Ira,
Mrkšić Nikola,
Reichart Roi,
Vulić Ivan,
Young Steve,
Ó Séaghdha Diarmuid,
Publication venue: Transactions of the Association for Computational Linguistics (TACL)
Publication date: 25/05/2017
Field of study

We present Attract-Repel, an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources. Attract-Repel facilitates the use of constraints from mono- and cross-lingual resources, yielding semantically specialised cross-lingual vector spaces. Our evaluation shows that the method can make use of existing cross-lingual lexicons to construct high-quality vector spaces for a plethora of different languages, facilitating semantic transfer from high- to lower-resource ones. The effectiveness of our approach is demonstrated with state-of-the-art results on semantic similarity datasets in six languages. We next show that Attract-Repel-specialised vectors boost performance in the downstream task of dialogue state tracking (DST) across multiple languages. Finally, we show that cross-lingual vector spaces produced by our algorithm facilitate the training of multilingual DST models, which brings further performance improvements.Ivan Vulic, Roi Reichart and Anna Korhonen are supported by the ERC Consolidator Grant LEXICAL (number 648909). Roi Reichart is also supported by the Intel-ICRI grant: Hybrid Models for Minimally Supervised Information Extraction from Conversations

arXiv.org e-Print Archive

Apollo (Cambridge)

CUED - Cambridge University Engineering Department

Intelligent Assistant Language Understanding On Device

Author: Aas Cecilia
Abdelsalam Hisham
Belousova Irina
Bhargava Shruti
Cheng Jianpeng
Daland Robert
Del Vecchio Marco
Driesen Joris
Flego Federico
Guigue Tristan
Johannsen Anders
Lal Partha
Lu Jiarui
Moniz Joel Ruben Antony
Perkins Nathan
Piraviperumal Dhivya
Pulman Stephen
Sun David Q.
Séaghdha Diarmuid Ó
Torr John
Wacker Jay
Williams Jason D.
Yu Hong
Publication venue
Publication date: 07/08/2023
Field of study

It has recently become feasible to run personal digital assistants on phones and other personal devices. In this paper we describe a design for a natural language understanding system that runs on device. In comparison to a server-based assistant, this system is more private, more reliable, faster, more expressive, and more accurate. We describe what led to key choices about architecture and technologies. For example, some approaches in the dialog systems literature are difficult to maintain over time in a deployment setting. We hope that sharing learnings from our practical experiences may help inform future work in the research community

arXiv.org e-Print Archive

Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research

Author: A Keselman
A Kolman
A Korhonen
Anna Korhonen
AR Feinstein
B Alex
C Boström
C Cortes
C Leslie
CC Chang
D Hattis
D McGregor
D Ó Séaghdha
Diarmuid Ó Séaghdha
DV Cicchetti
H Wang
H Wang
Ilona Silins
J Cohen
J Lin
J Shawe-Taylor
Johan Högberg
K Bouker
K Morgan
KB Cohen
L Hunter
Lin Sun
M Hein
M Jackson
N Cristianini
N Karamanis
Neil R. Smalheiser
P Zweigenbaum
Products EFSA Panel on Plant Protection
R Frijters
R Jelier
R Judson
RB Altman
S Ananiadou
S Cohen
Science US National Academy of
T Byrt
T Joachims
TC Rindesch
TG Dietterich
Ulla Stenius
Y Guo
YW Chen
Publication venue: Public Library of Science
Publication date: 12/04/2012
Field of study

Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB – a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Annotating and Learning Compound Noun Semantics

Author: Diarmuid Ó Séaghdha
Publication venue
Publication date: 01/01/2007
Field of study

There is little consensus on a standard experimental design for the compound interpretation task. This paper introduces well-motivated general desiderata for semantic annotation schemes, and describes such a scheme for in-context compound annotation accompanied by detailed publicly available guidelines. Classification experiments on an open-text dataset compare favourably with previously reported results and provide a solid baseline for future research

CiteSeerX

Crossref